The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Missing data for application train

Distribution of the target column

Correlation with the target column

Correlation Heatmap

Applicants Gender Distribution

Applicants Age

Applicants Contract Types

Applicants Real Estate Owners

Applicants Income Sources

Applicants Occupation Types

Distribution of Occupation Types

Distribution of Amount AMT_CREDIT

Distribution of AMT_ANNUITY

Distribution of Applicants Family Members Count

Processing the Pipeline

Baseline Model

Adding Previous Application Data

Baseline Model with Previous Application Data

Performance Metrices

Submission

Submitting to Kaggle

Write-up

Abstract

The project that we are working on is Home Credit Default Risk (HCDR). We are aiming to build a machine learning model that accurately assesses the risk for lenders. There are multiple parameters that need to be considered for optimally predicting if the client will be defaulting. The parameters that can be used are occupation, credit history, age, location, credit card usage, cash balance, and others. Therefore when evaluating a loan application, we will look at these parameters and help financial organizations make the best decisions possible for a long-term business.

In this phase, with features from three of the eight datasets available, we tested with classification algorithms such as Logistic regression and Random Forest as our baseline models. We ran four tests with the aforementioned models, and the results will be shown in the next sections of this presentation. We intend to develop a machine learning model that will allow Home Credit to accurately forecast payback risk, allowing more people to obtain much-needed loans.

Finally, we discovered that Logistic Regression produced better results. Following are the results that we got:

As you can see from the results, Logistic Regression’s score went up to 0.7602 after merging the history of the client applications whereas Random Forest did not show any improvements.

Project Description

Home Credit is a prominent developing market consumer financing specialist that has developed a platform that manages its core technology, product, and funding activities keeping in mind the local market needs. Their target market is underserved borrowers in the blue-collar and junior white-collar segments who have a steady source of income from their jobs or micro-businesses but are less likely to get loans from banks and other traditional lenders. It is vital for Financial Organizations to observe whether their loan applicants will be able to repay their loans.

The data which we are intending to use is being taken from Kaggle Home credit default risk. The Dataset has seven different files. The relationship between the files is shown in the below tabular diagram.

Introduction

Feature Engineering and transformers

Feature Engineering includes both feature selection(identifying only the most significant features or using a different dimentionality reduction technique) and feature creation(adding of new features to the currently existing data). A variegated techniques will be used to build and pick features such as taking SK_ID_CURR as a key, using aggregated functions, using group_by and using some features from the previous application dataset.

The below features are the features that we have used:

Selected features = ['SK_ID_CURR','AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE','NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

numerical_attributes = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1', 'EXT_SOURCE_2','EXT_SOURCE_3']

categorical_attributes = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

Work Flow

Pipelines

Numerical Pipeline Definitons: numerical_pipeline = Pipeline([ ('selector', DataFrameSelector(numerical_attributes)), ('imputer', SimpleImputer(strategy='mean')), ('std_scaler', StandardScaler()), ])

Categorical Pipeline Definitons: categorical_pipeline = Pipeline([ ('selector', DataFrameSelector(categorical_attributes)), ('imputer', SimpleImputer(strategy='constant')), ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore")), ])

data_prep_pipeline = FeatureUnion(transformer_list=[("numerical_pipeline", numerical_pipeline), ("categorical_pipeline", categorical_pipeline), ])

prepared_pipeline = Pipeline([("preparation", data_prep_pipeline)])

Experimental results

With Logistic Regression and Random Forest, we initially got test AUCs of 0.74 and 0.71, respectively, and after merging with the previous application data, we saw a little rise in AUC in Logistic Regression, but no effect in the Random Forest Model.

Discussion

We did four different experiments in this phase using Logistic Regression and Random Forrest algorithms. We started of with conducting the experiment with the Application Train data and then made use of the Application Test data set without using supporting datasets. We observed 0.7450 AUC with Logistic Regression and 0.7151 AUC with Random Forest. We further added additional features from previous application data and observed that the logistic regression AUC increased to 0.7602, but there was no significant increase seen for the Random Forest.

Conclusion

With inclusion of features from previous application dataset we observed that, The accuracy of our baseline logistic regression model was improved from 74% to 76%, whereas there was no significant change with the random forest model. This shows that exploring new features are beneficial, further it emphasizes the significance of EDA on choosing the right features to be used in our model. The next phases will be focused on: Exploring and adding new features, hyperparameter tuning, implementing new models and comparing the results to improve efficiency

Kaggle Submission